Emma Rand. (2022). Data Analysis in R (BIO00017C) 2020: 2022 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6359475
In this introduction you will start working with RStudio. You will typing in some data, perform some calculations on it and plot it.
By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:
Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
is something you need to do on your computer. It may be opening programs or documents or locating a file.
is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
is question for you to think about an answer. You will usually want to record your answers in your script for future reference.
Artwork by @allison_horst
Start RStudio from the Start menu.
My RStudio Anatomy may be a useful reference.
Go the Files tab in the lower right pane and click on the three dots on the right. This will open a “Go to folder” window. Navigate to a place on your computer (or University account if using the VDS) where you keep your work. Click Open.
Also on the Files tab click on New Folder. In the box that appears type “data-analysis-in-r”. This will be the folder that we work in throughout the Data Analysis in R part of 17C.
Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project, then New Directory, then New Project. Name the RStudio Project ‘workshop1’.
Make a new script then save it with a name like analysis.R to carry out the rest of the work.
We will work some data on the number of males in 64 bird nests with a clutch size of 5. You are going to type data in R, summarise and plot it
The data are as a frequency table:| No. males | No. nests |
|---|---|
| 0 | 4 |
| 1 | 13 |
| 2 | 14 |
| 3 | 15 |
| 4 | 13 |
| 5 | 5 |
You will create a figure like this:
Start by making a vector n that holds the numbers 0 to 5.
Write the following in your script:
# the number of males in a clutch of five
n <- 0:5Remember, the shortcut for <- is ALT+-
Notice I have used a comment. Comment your code as much as possible!
Ensure your cursor is on the line with the command and do CTRL+ENTER to send the command to the console to be executed.
Examine the ‘structure’ of the
n object using str()
str(n)## int [1:6] 0 1 2 3 4 5
It’s vector of 6 integers.
Create a vector called
freq containing the numbers of nests with 0 to 5 males and examine it with str().
Check
sum(freq) gives the answer you expect:
# the total number of nests
sum(freq)## [1] 64
We have frequencies so to find the mean number of males per nest we need the total number of males:
| No. males | No. nests | No. males *No. nests |
|---|---|---|
| 0 | 4 | 0 |
| 1 | 13 | 13 |
| 2 | 14 | 28 |
| 3 | 15 | 45 |
| 4 | 13 | 52 |
| 5 | 5 | 25 |
| Total | 64 | 163 |
So the mean is: \[\frac{163}{64} = 2.55\]
Let us do this in R.
Calculate the total number of nests:
total_nests <- sum(freq)Notice we have assigned the value to a variable that we will be able to use later.
Calculate the total number of males
total_males <- sum(n * freq) Calculate the the mean number of males per nest:
total_males/total_nests## [1] 2.546875
R works ‘elementwise’ unlike most programming languages.
n * freq gives
\[\begin{bmatrix}0\\1\\2\\3\\4\\5\end{bmatrix}\times\begin{bmatrix}4\\13\\14\\15\\13\\5\end{bmatrix}=\begin{bmatrix}0\\13\\28\\45\\52\\25\end{bmatrix}\]
It was designed to make it easy to work with data.
ggplot()Commands like c(), sum(), and str() are part the ‘base’ R system.
Base packages (collections of commands) always come with R.
Other packages, such as ggplot2 (Wickham, 2016) need to be added. ggplot2 is one of the tidyverse (Wickham Averick, et al., 2019) packages.
Added packages need only be installed once but must be loaded each R session.
📢 If you are working on a University Computer or the VDS you do not need to install tidyverse. 📢
If you are working on your own computer or using RStudio cloud you will need to install tidyverse.
To install a package:
Go the Packages tab on the lower right pane. Click Install and type
tidyverse into the box that appears.
Wait until you get the prompt back. It will take a few moments, be patient!
To load a package which you have already installed we use the library() function. You will need to do this wherever you are working.
Load the
tidyverse:
library(tidyverse)You will likely be warned of some function name conflicts but these will not be a problem for you.
ggplot() takes a dataframe for an argument
We can make a dataframe of the two vectors, n and freq usinf the data.frame() function.
Make a dataframe called
nest_data
nest_data <- data.frame(n = factor(n), freq)n was made into a factor (a categorical variable) because there are only 6 values and I want to make a bar plot.
Check the structure of
nest_data
Click on nest_data in the Environment to open a spreadsheet-like view of it.
Create a simple barplot using
ggplot like this:
ggplot(data = nest_data, aes(x = n, y = freq)) +
geom_col()ggplot() alone creates a blank plot.
ggplot(data = nest_data) looks the same.
aes() gives the ‘Aesthetic mappings’. How variables (columns) are mapped to visual properties (aesthetics) e.g., axes, colour, shapes.
Thus…
ggplot(data = nest_data, aes(x = n, y = freq)) produces a plot with axes
geom_col A ‘Geom’ (Geometric object) gives the visual representations of the data: points, lines, bars, boxplots etc.
Note that ggplot2 is the name of the package and ggplot() is its most important command.
‘Arguments’ can be added to the geom_col() command inside the brackets.
Commands do something and their arguments (in brackets) and can specify:
Many arguments have defaults so you don’t always need to supply them.
Open the manual page for
geom_col() using:
?geom_colThe manual page has several sections.
... means etc and includes arguments that can be passed to many ‘geoms’ Change the fill of the bars using
fill:
ggplot(data = nest_data, aes(x = n, y = freq)) +
geom_col(fill = "lightblue")Colours can be given by their name, “lightblue” or code, “#ADD8E6”.
Change the bars to a colour you like.
fill is one of the arguments covered by .... fill is an ‘aesthetic’. If you look for ... in the list of arguments you will see it says:
Other arguments passed on to layer(). These are often aesthetics, used to set an aesthetic to a fixed value, like colour = “red” or size = 3. They may also be parameters to the paired geom/stat.
Further down the manual, there is a section on Aesthetics which lists those understood by geom_col()
We can set (map) the fill aesthetic to a particular colour inside geom_col() or map it to a variable inside the aes() instead.
Map the
fill aesthetic to the n variable:
ggplot(data = nest_data, aes(x = n, y = freq, fill = n)) +
geom_col()Mapping fill to a variable means the colour varies for each value of n.
Use the manual to put the bars next to each other. Look for the argument that will mean there is no space between the bars.
Use the manual to change the colour of the lines around each bar to black.
Top Tip
Make your code easier to read by using white space and new lines
= , -> and after ,We can make changes to the axes using:
scale_x_discrete()scale_y_continuous()ggplot automatically extends the axes slightly. You can turn this behaviour off with the expand argument in scale_x_discrete() and scale_y_continuous().
Remove the gap between the axes and the data:
ggplot(data = nest_data, aes(x = n, y = freq)) +
geom_col(fill = "lightblue",
width = 1,
colour = "black") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) Each ‘layer’ is added to the ggplot() command with a +
Look up
scale_x_discrete in the manual and work out how to change the axis title from “n” to “Number of Males”. Also change the y-axis title.
You’re finished!
Artwork by @allison_horst
Please note that next week’s work assumes you have carried out the independent study.
Read 1.1 On the psychology of statistics section from Dani Navarro’s Learning statistics with R. It is the first part of Chapter 1 Why do we learn statistics? Approx 5 minutes.
Watch Getting help in RStudio which explains how to bring up the manual pages and understand them. Getting help in RStudio. Being able to use the manual is a threshold concept in R. You will get a feel for the structure and pattern of commands much more quickly if you make a habit of briefly reading the manual for the commands you are using.
These contain all the code needed in the workshop even where it is not visible on the webpage.
Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown. It allows R code and ordinary text to be interweaved to produce well-formatted reports including webpages. If you right-click on the link and choose Save-As, you will be able to open the Rmd file in RStudio. Alternatively, View in Browser.
Plain script file This is plain script (.R) version of the practical generated from the Rmd. Again, you can save this and open it RStudio. Alternatively, View in Browser.
Pages made with rmarkdown (Allaire Xie, et al., 2019a; Xie Allaire, et al., 2018a), kableExtra(Zhu, 2019a), RefManager(McLean, 2014)
Allaire, J., Y. Xie, et al. (2019a). rmarkdown: Dynamic Documents for R. R package version 1.16. URL: https://github.com/rstudio/rmarkdown.
McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. URL: https://arxiv.org/abs/1403.2036.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN: 978-3-319-24277-4. URL: https://ggplot2.tidyverse.org.
Wickham, H., M. Averick, et al. (2019). “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43, p. 1686. DOI: 10.21105/joss.01686.
Xie, Y., J. Allaire, et al. (2018a). R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.
Zhu, H. (2019a). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra.
Emma Rand. (2022). Data Analysis in R (BIO00017C) 2020: 2022 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6359475